CSMAD21 Coursework - 31837482 23/24¶

Module Code: CSMAD21
Assignment Report Title: Metro and Seeds Data Analysis
Date (when work was completed): 06/12/2023
Actual hours spent on assignment: 50


Task List¶

  • Task 1: Bicycle Journey Data
  • Task 2: Seed Shape Analysis


Task 1: Bicycle Journey Data

Before I begin¶

I browsed the website that provides information to learn more about the data source, its meaning, and its purposes. Establish a rough idea of the information received initially.

Data Preprocessing¶

Import Library pandas, numpy for data manipulating Import Library matplotlib for data plotting

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

At first, I set low_memory=False to treat blanks as strings. Later, I found it consumed memory, so I used the info() function to summarize the DataFrame.

In [2]:
#metro = pd.read_csv("datasets/metro.csv", low_memory=False) 
metro = pd.read_csv("datasets/metro.csv")

# Take an overview of the data
metro.info() 

# Check missing data
metro.isnull().sum()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92124 entries, 0 to 92123
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   trip_id              92124 non-null  int64  
 1   duration             92124 non-null  int64  
 2   start_time           92124 non-null  object 
 3   end_time             92124 non-null  object 
 4   start_station        92124 non-null  int64  
 5   start_lat            89985 non-null  float64
 6   start_lon            89985 non-null  float64
 7   end_station          92124 non-null  int64  
 8   end_lat              88052 non-null  float64
 9   end_lon              88052 non-null  float64
 10  bike_id              92124 non-null  object 
 11  plan_duration        92124 non-null  int64  
 12  trip_route_category  92124 non-null  object 
 13  passholder_type      92124 non-null  object 
 14  bike_type            92124 non-null  object 
dtypes: float64(4), int64(5), object(6)
memory usage: 10.5+ MB
/var/folders/q7/8l57xp494s5832pk1vz3_4tc0000gn/T/ipykernel_59877/1147889363.py:2: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
  metro = pd.read_csv("datasets/metro.csv")
Out[2]:
trip_id                   0
duration                  0
start_time                0
end_time                  0
start_station             0
start_lat              2139
start_lon              2139
end_station               0
end_lat                4072
end_lon                4072
bike_id                   0
plan_duration             0
trip_route_category       0
passholder_type           0
bike_type                 0
dtype: int64

The result list shows that:

  • The latitude and longatude data have missing data.
  • The Columns (10), which is bike_id, contains multiple data types.

Check what kind of data type in bike_id and its count¶

In [3]:
metro.bike_id.apply(type).value_counts()
Out[3]:
bike_id
<class 'str'>    65536
<class 'int'>    26588
Name: count, dtype: int64

It is possible to convert numbers to strings, but not the other way around since ID-type data can be stored as a string type.

When inputting data, specify the datatype of bike_id and trip_id as string. Since trip_id is a unique sequence, it can be dealt with as a string.

For a simpler dataset, remove latitude, longitude, and bike_id columns. Convert start_time and end_time to datetime format for data manipulation.

In [4]:
dtype_spec = {"bike_id": "string","trip_id": "string"}

metro = pd.read_csv("datasets/metro.csv", dtype = dtype_spec) 
metro['start_time'] = pd.to_datetime(metro['start_time'])
metro['end_time'] = pd.to_datetime(metro['end_time'])

metro.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92124 entries, 0 to 92123
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   trip_id              92124 non-null  string        
 1   duration             92124 non-null  int64         
 2   start_time           92124 non-null  datetime64[ns]
 3   end_time             92124 non-null  datetime64[ns]
 4   start_station        92124 non-null  int64         
 5   start_lat            89985 non-null  float64       
 6   start_lon            89985 non-null  float64       
 7   end_station          92124 non-null  int64         
 8   end_lat              88052 non-null  float64       
 9   end_lon              88052 non-null  float64       
 10  bike_id              92124 non-null  string        
 11  plan_duration        92124 non-null  int64         
 12  trip_route_category  92124 non-null  object        
 13  passholder_type      92124 non-null  object        
 14  bike_type            92124 non-null  object        
dtypes: datetime64[ns](2), float64(4), int64(4), object(3), string(2)
memory usage: 10.5+ MB

To have better control over the time data, the 'start_time' and 'end_time' are transformed to datetime. During the data cleaning stage, properties such as latitude, longitude, bike_id, and trip_id which are less necessary for analysis, will be removed in reference to the data source.

To obtain different statistics of data, make use of the describe() function.¶

In [5]:
metro.describe()
Out[5]:
duration start_time end_time start_station start_lat start_lon end_station end_lat end_lon plan_duration
count 92124.000000 92124 92124 92124.000000 89985.000000 89985.000000 92124.000000 88052.000000 88052.000000 92124.000000
mean 33.168588 2019-08-17 07:30:19.200208640 2019-08-17 08:13:36.216621056 3484.899690 34.034786 -118.287893 3480.271026 34.034895 -118.286699 60.290977
min 1.000000 2019-07-01 00:04:00 2019-07-01 00:09:00 3000.000000 33.710979 -118.495422 3000.000000 33.710979 -118.495422 1.000000
25% 6.000000 2019-07-26 06:13:15 2019-07-26 07:08:15 3029.000000 34.035801 -118.281181 3028.000000 34.037048 -118.280952 1.000000
50% 12.000000 2019-08-18 11:05:00 2019-08-18 12:54:00 3062.000000 34.046810 -118.258537 3062.000000 34.046810 -118.258537 30.000000
75% 22.000000 2019-09-08 16:21:00 2019-09-08 16:54:15 4285.000000 34.051941 -118.248253 4285.000000 34.051941 -118.248253 30.000000
max 1440.000000 2019-09-30 23:58:00 2019-10-02 12:04:00 4453.000000 34.177662 -118.231277 4453.000000 34.177662 -118.231277 999.000000
std 129.057841 NaN NaN 611.483883 0.058803 0.073501 609.942741 0.058790 0.072628 111.141364

The table displays the statistics of time and duration categories, which are significant as they reveal important information such as mean, minimum, maximum value and quartile data. However, the station column is a categorical data type represented by numbers, which can lead to misleading results. To avoid this, it is recommended to transform the station column into a string format.

In [6]:
# transform to stirng
metro['start_station'] = metro['start_station'].apply(str)
metro['end_station'] = metro['end_station'].apply(str)

metro.info()

# the statistics includes numeric data only
metro.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92124 entries, 0 to 92123
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   trip_id              92124 non-null  string        
 1   duration             92124 non-null  int64         
 2   start_time           92124 non-null  datetime64[ns]
 3   end_time             92124 non-null  datetime64[ns]
 4   start_station        92124 non-null  object        
 5   start_lat            89985 non-null  float64       
 6   start_lon            89985 non-null  float64       
 7   end_station          92124 non-null  object        
 8   end_lat              88052 non-null  float64       
 9   end_lon              88052 non-null  float64       
 10  bike_id              92124 non-null  string        
 11  plan_duration        92124 non-null  int64         
 12  trip_route_category  92124 non-null  object        
 13  passholder_type      92124 non-null  object        
 14  bike_type            92124 non-null  object        
dtypes: datetime64[ns](2), float64(4), int64(2), object(5), string(2)
memory usage: 10.5+ MB
Out[6]:
duration start_time end_time start_lat start_lon end_lat end_lon plan_duration
count 92124.000000 92124 92124 89985.000000 89985.000000 88052.000000 88052.000000 92124.000000
mean 33.168588 2019-08-17 07:30:19.200208640 2019-08-17 08:13:36.216621056 34.034786 -118.287893 34.034895 -118.286699 60.290977
min 1.000000 2019-07-01 00:04:00 2019-07-01 00:09:00 33.710979 -118.495422 33.710979 -118.495422 1.000000
25% 6.000000 2019-07-26 06:13:15 2019-07-26 07:08:15 34.035801 -118.281181 34.037048 -118.280952 1.000000
50% 12.000000 2019-08-18 11:05:00 2019-08-18 12:54:00 34.046810 -118.258537 34.046810 -118.258537 30.000000
75% 22.000000 2019-09-08 16:21:00 2019-09-08 16:54:15 34.051941 -118.248253 34.051941 -118.248253 30.000000
max 1440.000000 2019-09-30 23:58:00 2019-10-02 12:04:00 34.177662 -118.231277 34.177662 -118.231277 999.000000
std 129.057841 NaN NaN 0.058803 0.073501 0.058790 0.072628 111.141364

Although the table is now more apparent, the latitude and longitude data are still meaningless in the statistics and will be processed later.

To find out the comprehensive correlation between data¶

Then show the first 5 data row to have a brief overview.¶

In [7]:
metro.head()
Out[7]:
trip_id duration start_time end_time start_station start_lat start_lon end_station end_lat end_lon bike_id plan_duration trip_route_category passholder_type bike_type
0 124657107 5 2019-07-01 00:04:00 2019-07-01 00:09:00 4312 34.066990 -118.290878 4410 34.063351 -118.296799 6168 30 One Way Monthly Pass standard
1 124657587 9 2019-07-01 00:07:00 2019-07-01 00:16:00 3066 34.063389 -118.236160 3066 34.063389 -118.236160 17584 30 Round Trip Monthly Pass electric
2 124658068 5 2019-07-01 00:20:00 2019-07-01 00:25:00 4410 34.063351 -118.296799 4312 34.066990 -118.290878 18920 30 One Way Monthly Pass electric
3 124659747 20 2019-07-01 00:44:00 2019-07-01 01:04:00 3045 34.028511 -118.256668 4275 34.012520 -118.285896 6016 1 One Way Walk-up standard
4 124660227 27 2019-07-01 00:44:00 2019-07-01 01:11:00 3035 34.048401 -118.260948 3049 34.056969 -118.253593 5867 30 One Way Monthly Pass standard

Find distinct values for plan_duration, trip_route_category, passholder_type, and bike_type.

Create a bar plot to visualize the results for qualitative data.¶

In [8]:
plt.figure(figsize=(16,8))
plt.subplot(2,2,1)
plan = metro["plan_duration"].unique()
plan_count = metro["plan_duration"].value_counts()
y_pos = y_pos = np.arange(len(plan))
plt.bar(y_pos, plan_count)
plt.xticks(y_pos, plan)
plt.xlabel('Plan Duration')
plt.ylabel('Count')

plt.subplot(2,2,2)
route = metro["trip_route_category"].unique()
route_count = metro["trip_route_category"].value_counts()
y_pos = y_pos = np.arange(len(route))
plt.bar(y_pos, route_count)
plt.xticks(y_pos, route)
plt.xlabel('Trip Route Category')
plt.ylabel('Count')

plt.subplot(2,2,3)
passholder = metro["passholder_type"].unique()
passholder_count = metro["passholder_type"].value_counts()
y_pos = y_pos = np.arange(len(passholder))
plt.bar(y_pos, passholder_count)
plt.xticks(y_pos, passholder)
plt.xlabel('Passholder Type')
plt.ylabel('Count')

plt.subplot(2,2,4)
bike = metro["bike_type"].unique()
bike_count = metro["bike_type"].value_counts()
y_pos = y_pos = np.arange(len(bike))
plt.bar(y_pos, bike_count)
plt.xticks(y_pos, bike)
plt.xlabel('Bike Type')
plt.ylabel('Count')

plt.show()

The set of bar plots indicate that the proportion of each attribute categorcal values.

It can be initially seen that the consumption trends are as follows

  • the 30-day plan duraiton is the most popular comsumption.
  • On-way trip is much more than round-trip
  • Most consumers use monthly passes
  • The selected bicycle types are ranked in order: standard, electric, smart

The number of "Flex Pass" and "Testing" is too small relative to other categories and may be barely visible on the histogram.

Check the amount each category first¶

In [9]:
print(metro.groupby('passholder_type')['duration'].size())
passholder_type
Annual Pass      6220
Flex Pass           6
Monthly Pass    57175
One Day Pass     5175
Testing            46
Walk-up         23502
Name: duration, dtype: int64

Using a logarithmic scale for the Y-axis emphasizes low-volume data but may affect data intuitiveness.¶

In [10]:
passholder = metro["passholder_type"].unique()
passholder_count = metro["passholder_type"].value_counts()
y_pos = np.arange(len(passholder))

plt.bar(y_pos, passholder_count)
plt.xticks(y_pos, passholder)
plt.xlabel('Passholder Type')
plt.ylabel('Count')
plt.yscale('log')

Found that "Testing" data exists. However, when considering data analysis those data should be excluded.

It will be processed later in the data cleaning statge.

Data Cleaning¶

Count the duplicated data to make sure there is no duplicated data.¶

In [11]:
metro.duplicated().sum()
Out[11]:
0

No duplicated data.

Remove irrelevant fields for the reasons mentioned above¶

In [12]:
columns_to_remove = ['trip_id' ,'start_lat', 'start_lon', 'end_lat', 'end_lon', 'bike_id']
clean_metro = metro.drop(columns=columns_to_remove, axis=1)
clean_metro.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92124 entries, 0 to 92123
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   duration             92124 non-null  int64         
 1   start_time           92124 non-null  datetime64[ns]
 2   end_time             92124 non-null  datetime64[ns]
 3   start_station        92124 non-null  object        
 4   end_station          92124 non-null  object        
 5   plan_duration        92124 non-null  int64         
 6   trip_route_category  92124 non-null  object        
 7   passholder_type      92124 non-null  object        
 8   bike_type            92124 non-null  object        
dtypes: datetime64[ns](2), int64(2), object(5)
memory usage: 6.3+ MB

To exclude the testing data, conditionally delete rows where passholder_type is "Testing".

Check the distinct data left which are 'Monthly Pass', 'Walk-up', 'Annual Pass', 'One Day Pass' and 'Flex Pass'.¶

In [13]:
clean_metro = clean_metro.drop(clean_metro[clean_metro['passholder_type'] == "Testing"].index)
clean_metro["passholder_type"].unique()
Out[13]:
array(['Monthly Pass', 'Walk-up', 'Annual Pass', 'One Day Pass',
       'Flex Pass'], dtype=object)

Count missing values of each attribute and list the fields¶

In [14]:
clean_metro.isnull().sum()
Out[14]:
duration               0
start_time             0
end_time               0
start_station          0
end_station            0
plan_duration          0
trip_route_category    0
passholder_type        0
bike_type              0
dtype: int64

The latitude and longitude attributes, which were initially considered unimportant for the analysis and removed.

Now check the correlation table of the clean data¶

In [15]:
# numeric data only
clean_numeric_columns = clean_metro.select_dtypes(include=[np.number])
clean_correlation_matrix = clean_numeric_columns.corr()
clean_correlation_matrix
Out[15]:
duration plan_duration
duration 1.000000 0.041844
plan_duration 0.041844 1.000000

A correlation coefficient of 0.041844 indicates a weak and non-significant positive correlation. As a result, it should be considered in conjunction with other data and analysis when making decisions based on data analysis.

Identification, explanation, and handling of missing data, outliers, and other irregularities¶

Missing Data

If the data from a station is available, its latitude and longitude can be obtained from the JSON format data provided by the website API. This is because the missing data is station-dependent. In case the data needs to be analyzed later, it can be completed with the supplementary information.

Outliers

Identify outliers by using boxplot function¶

In [16]:
clean_metro.boxplot(column='duration')
plt.show()
In [17]:
# calculate Inter Quartile Range
Q1 = clean_metro['duration'].quantile(0.25)
Q3 = clean_metro['duration'].quantile(0.75)
IQR = Q3 - Q1
print('Q3: ',Q3)
print('Q1: ',Q1)
print('IQR:',IQR)
Q3:  22.0
Q1:  6.0
IQR: 16.0
In [18]:
# calculate upper and lower limit
upper_limit = Q3 + 1.5*IQR
lower_limit = Q1 - 1.5*IQR
print('upper limit:', upper_limit)
print('lower limit:', lower_limit)
upper limit: 46.0
lower limit: -18.0

Remove outliers¶

In [19]:
clean_metro_no_outlier = clean_metro[clean_metro['duration'] < upper_limit]

Useseaborn to compare before and after outlier removing¶

In [20]:
import seaborn as sns

print("Before removing ouliers")
plt.figure(figsize=(16,8))
plt.subplot(2,2,1)
sns.histplot(clean_metro['duration'], kde=True)

plt.subplot(2,2,2)
sns.boxplot(data = clean_metro['duration'])
plt.show()

print("After removing ouliers")
plt.figure(figsize=(16,8))
plt.subplot(2,2,1)
sns.histplot(clean_metro_no_outlier['duration'], kde=True)

plt.subplot(2,2,2)
sns.boxplot(data = clean_metro_no_outlier['duration'])

plt.show()
Before removing ouliers
After removing ouliers

Before removing ouliers: The duration variable is highly skewed with a long tail of distribution due to a few extremely high values. The boxplot shows outliers with very high values.

After removing ouliers: The cleaned data has a more symmetrical distribution, making it easier to observe trends. The boxplot shows less outliers and shorter whiskers, making the median and interquartile range more visible. The remaining dataset provides a more reliable analysis by reducing the impact of extreme values on statistics such as averages.

Save cleaned data¶

In [21]:
clean_metro_no_outlier.to_csv('datasets/metro_cleaned.csv', index=False)

Read cleaned data¶

In [22]:
dtype_spec = {"start_station": "string","end_station": "string",
              "trip_route_category": "string","passholder_type": "string",
              "bike_type": "string"}

bicycle = pd.read_csv("datasets/metro_cleaned.csv", dtype = dtype_spec) 

bicycle['start_time'] = pd.to_datetime(bicycle['start_time'])
bicycle['end_time'] = pd.to_datetime(bicycle['end_time'])

bicycle.info()
bicycle.head()
bicycle.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83611 entries, 0 to 83610
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   duration             83611 non-null  int64         
 1   start_time           83611 non-null  datetime64[ns]
 2   end_time             83611 non-null  datetime64[ns]
 3   start_station        83611 non-null  string        
 4   end_station          83611 non-null  string        
 5   plan_duration        83611 non-null  int64         
 6   trip_route_category  83611 non-null  string        
 7   passholder_type      83611 non-null  string        
 8   bike_type            83611 non-null  string        
dtypes: datetime64[ns](2), int64(2), string(5)
memory usage: 5.7 MB
Out[22]:
duration start_time end_time plan_duration
count 83611.000000 83611 83611 83611.000000
mean 13.111433 2019-08-17 17:10:09.504012544 2019-08-17 17:23:16.189975040 60.837689
min 1.000000 2019-07-01 00:04:00 2019-07-01 00:09:00 1.000000
25% 6.000000 2019-07-26 12:36:00 2019-07-26 12:46:30 30.000000
50% 10.000000 2019-08-18 19:24:00 2019-08-18 19:43:00 30.000000
75% 19.000000 2019-09-09 11:46:00 2019-09-09 11:59:30 30.000000
max 45.000000 2019-09-30 23:58:00 2019-10-01 00:03:00 365.000000
std 9.497429 NaN NaN 108.156262

Explanatory data visualisations¶

Observe cross data diagrams to have a brief concept of this data by using crosstab¶

In [23]:
pd.crosstab(bicycle['passholder_type'], bicycle['duration'])
Out[23]:
duration 1 2 3 4 5 6 7 8 9 10 ... 36 37 38 39 40 41 42 43 44 45
passholder_type
Annual Pass 246 258 468 560 574 484 376 333 353 313 ... 5 4 6 6 1 0 4 1 0 2
Flex Pass 0 0 0 0 1 0 2 2 1 0 ... 0 0 0 0 0 0 0 0 0 0
Monthly Pass 1396 1776 3482 4352 4454 4124 3693 2988 2686 2544 ... 92 79 74 66 53 49 54 40 40 37
One Day Pass 219 56 57 64 60 74 87 107 99 115 ... 40 40 41 60 36 32 36 30 26 36
Walk-up 695 167 227 332 500 631 748 725 733 728 ... 231 204 164 164 189 172 157 144 124 139

5 rows × 45 columns

To get a basic understanding of the travel habits of different types of pass holders and visualization.¶

In [24]:
crosstab = pd.crosstab(bicycle['trip_route_category'], bicycle['passholder_type'])

# visualization
import matplotlib.pyplot as plt

crosstab.plot(kind = "bar", stacked = True, figsize = (6, 4))

plt.title("Trip Route Category vs Passholder Type")
plt.xlabel("Trip Route Category")
plt.ylabel("Number of Trips")
plt.xticks(rotation=0)  # Rotate labels to make them readable

plt.show()

Most monthly pass holders prefer to use public transportation for one-way commutes, possibly because they commute to locations with stable round-trip journeys. "Flex Pass" seems to have very little usage, especially in the "Round Trip" category, and is not even recorded.

To get insights into the popularity of different bike types for one-way and round-trip journeys by visualizaton.¶

In [25]:
crosstab_bike_type = pd.crosstab(bicycle['bike_type'],
                                 bicycle['trip_route_category'
                                 ])

crosstab_bike_type.plot(kind='bar', stacked=True, figsize=(6, 4))

plt.title('Bike Type vs Trip Route Category')
plt.xlabel('Bike Type')
plt.ylabel('Number of Trips')
plt.xticks(rotation=0)  # Rotate labels to make them readable

plt.show()

Standard bicycles are the most popular and commonly used travel tools. Electrically assisted bicycles and smart bicycles have low acceptance in the market, or it may be price and availability factors that influence users' choices.

Reduction and/or transformation (e.g., on the time fields)¶

Illustrate the number of bicycle journeys according to journey duration, differentiated by journey type (one way or round trip).¶

In [26]:
bins = [0, 15, 30, float('inf')]
labels = ['Short (<15 min)', 'Medium (15-30 min)', 'Long (>30 min)']

bicycle['duration_group'] = pd.cut(bicycle['duration'], bins=bins, labels=labels)

crosstab_bike_type = pd.crosstab(bicycle['duration_group'],
                                 bicycle['trip_route_category'
                                 ])

crosstab_bike_type.plot(kind='bar', stacked=True, figsize=(6, 4))

plt.title('Duration Group vs Trip Route Category')
plt.xlabel('Duration Group')
plt.ylabel('Number of Trips')
plt.xticks(rotation=0)

plt.show()

After binning duration shows that short-distance trips are more popular than medium- and long-distance trips, implying that most users prefer bike-sharing services for commuting over short distances. One-way trips are more frequent than round-trip journeys across all three duration categories, indicating that users mostly use bike-sharing services to travel from one point to another. The number of long-distance journeys is relatively low, which could be attributed to the fact that bicycles are more suitable for short-distance travel, or because users often opt for other transportation modes for longer journeys.

Show the usage of different types of bicycles (standard, smart, electric) in different journey duration groupings (short, medium, long).¶

In [27]:
crosstab_bike_type = pd.crosstab(bicycle['duration_group'],
                                 bicycle['bike_type'
                                 ])

crosstab_bike_type.plot(kind='bar', stacked=True, figsize=(6, 4))

plt.title('Duration Group vs Bike Type')
plt.xlabel('Duration Group')
plt.ylabel('Bike Type')
plt.xticks(rotation=0)  # Rotate labels to make them readable
plt.show()

Based on the data, it appears that people tend to use regular bicycles more frequently for short trips as they are easily accessible and affordable. However, for longer journeys, e-bikes with their assisted power are a more desirable option. Bike-sharing system operators can utilize this information to optimize bike allocation and maintenance strategies, and adjust the supply of bikes based on the varying needs for journey duration.

Feature transformation: Extract the hour of the day and the day of the week from 'start_time'¶

In [28]:
bicycle['start_hour'] = bicycle['start_time'].dt.hour
bicycle['start_day_of_week'] = bicycle['start_time'].dt.day_name()
bicycle.head()
Out[28]:
duration start_time end_time start_station end_station plan_duration trip_route_category passholder_type bike_type duration_group start_hour start_day_of_week
0 5 2019-07-01 00:04:00 2019-07-01 00:09:00 4312 4410 30 One Way Monthly Pass standard Short (<15 min) 0 Monday
1 9 2019-07-01 00:07:00 2019-07-01 00:16:00 3066 3066 30 Round Trip Monthly Pass electric Short (<15 min) 0 Monday
2 5 2019-07-01 00:20:00 2019-07-01 00:25:00 4410 4312 30 One Way Monthly Pass electric Short (<15 min) 0 Monday
3 20 2019-07-01 00:44:00 2019-07-01 01:04:00 3045 4275 1 One Way Walk-up standard Medium (15-30 min) 0 Monday
4 27 2019-07-01 00:44:00 2019-07-01 01:11:00 3035 3049 30 One Way Monthly Pass standard Medium (15-30 min) 0 Monday

Show the distribution of journey duration (Duration) for each day of the week.

In [29]:
sns.violinplot(x='start_day_of_week', y='duration', data=bicycle)
plt.title('Duration Distribution by Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Duration (minutes)')
plt.show()

The duration of journeys appears to be quite consistent throughout the week, with a median duration of between 10 and 15 minutes. The daily distribution of journey durations has a similar shape, indicating that users' usage patterns are relatively consistent regardless of the day of the week. The violin plots for Saturday and Sunday are slightly wider than on weekdays, which suggests that journey durations may vary more on weekends, or that users may tend to take slightly longer journeys. The extended tail of the violin plot shows that there are longer journey durations on some days, but these are not necessarily outliers and may simply be less common data points within the normal range.

To see the usage of different passholder_types by day of the week.¶

In [30]:
pd.crosstab(bicycle['start_day_of_week'], bicycle['passholder_type']).plot(kind='bar', stacked=True)
plt.title('Usage by Passholder Type and Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Number of Trips')
plt.figure(figsize=(6, 3))
Out[30]:
<Figure size 600x300 with 0 Axes>
<Figure size 600x300 with 0 Axes>

Insights into pass usage behavior are critical for operational planning and marketing strategies. For instance, if a bike-sharing service aims to boost weekend usage, it may need to create promotions that target weekend users. It's worth noting that monthly ticket holders are the most common user group, so service providers may want to focus on enhancing user satisfaction and loyalty among this group.

Get the relationship between the hour of day when the journey starts and the duration of the journey (Duration).¶

In [31]:
sns.scatterplot(x='start_hour', y='duration', data=bicycle)
plt.title('Duration by Start Hour of Trip')
plt.xlabel('Hour of Day')
plt.ylabel('Duration (minutes)')
plt.figure(figsize=(6, 3))
Out[31]:
<Figure size 600x300 with 0 Axes>
<Figure size 600x300 with 0 Axes>

The distribution of trip duration is relatively consistent throughout the day, with no obvious trend indicating a significant increase or decrease in trip duration within a specific period. Most trips lasted less than 30 minutes, which may be because the bike-sharing system is designed for short trips or is based on a metered structure. The scatter plot shows that there are more journeys departing during certain small time periods, which could be peak or other busy times. The density of points on the scatter plot helps identify these cycles of activity.

Display the top 10 most popular starting sites in bike-sharing systems.¶

In [32]:
bicycle['start_station'].value_counts().head(10).plot(kind='bar')
plt.title('Top 10 Most Popular Start Stations')
plt.xlabel('Station ID')
plt.ylabel('Number of Trips Started')
plt.figure(figsize=(6, 3))
Out[32]:
<Figure size 600x300 with 0 Axes>
<Figure size 600x300 with 0 Axes>

The chart shows the number of journeys that started from each station, ranked by ID on the X-axis. Site ID 3005 is the most popular starting point, but usage is not uniform. Understanding usage patterns is crucial for maintenance and city planning. Operators should ensure high-traffic sites have enough bikes and staff during peak times.

Represent the route traffic between different start stations and end stations.¶

In [33]:
import seaborn as sns
import matplotlib.pyplot as plt

route_counts = bicycle.groupby(['start_station', 'end_station']).size().reset_index(name='count')
pivot_table = route_counts.pivot(index='start_station', columns='end_station', values='count')

plt.figure(figsize=(8,6))
sns.heatmap(pivot_table, cmap="YlGnBu")
plt.title('Heatmap of Route Traffic')
plt.xlabel('End Station')
plt.ylabel('Start Station')
plt.show()

In the heat map, the color shade indicates the number of journeys with lighter shades representing fewer journeys and darker shades representing more journeys.

The bike share route traffic heat map helps operators identify popular stations and routes for targeted operational decisions. However, analyzing this data with other factors like location, population density, and transportation methods is crucial for a comprehensive understanding.

Summary and more granular statistics¶

Show bicycle usage over time: Number of bicycle journeys per day between July and September 2019.¶

In [34]:
plt.figure(figsize=(6, 4))
plt.title('Daily Trips Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Trips')
bicycle.resample('D', on='start_time').size().plot()
Out[34]:
<Axes: title={'center': 'Daily Trips Over Time'}, xlabel='start_time', ylabel='Number of Trips'>

There is a certain pattern to the fluctuations in the data, which may be due to differences between weekdays and weekends, or regular changes caused by certain daily activities. It appears that the data fluctuations are centered around a specific level and exhibit a certain periodicity.

How the duration variable changes between each journey's starting hour and day of the week¶

Convert start_time to datetime¶

In [35]:
bicycle['start_time'] = pd.to_datetime(bicycle['start_time'])

bicycle['start_hour'] = bicycle['start_time'].dt.hour
bicycle['day_of_week'] = bicycle['start_time'].dt.day_name()

mean_duration_by_hour = bicycle.groupby('start_hour')['duration'].mean()
mean_duration_by_day = bicycle.groupby('day_of_week')['duration'].mean()

Average riding time per hour¶

In [36]:
plt.figure(figsize=(6, 3))
sns.barplot(x=mean_duration_by_hour.index, y=mean_duration_by_hour.values, 
            hue=mean_duration_by_hour.index, 
            palette='muted', 
            legend=False)
plt.title('Average Ride Duration by Start Hour')
plt.xlabel('Hour of Day')
plt.ylabel('Average Duration (minutes)')
plt.xticks(range(0, 24))
plt.grid(axis='y')

plt.show()

Riding time is longer during late night to early morning hours (0-5 o'clock) with highest average riding times around 0 and 3 o'clock. As morning approaches, the average riding time gradually decreases, which could be related to people commuting to work. Throughout the day (9:00 to 19:00), the average riding time generally increases, especially in the afternoon. Although there is a slight drop in riding times during the evening hours, the average is still relatively high.

Average riding time per day of the week¶

In [37]:
plt.figure(figsize=(6, 3))
ordered_days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
sns.barplot(x=mean_duration_by_day.index, y=mean_duration_by_day.reindex(ordered_days), 
            hue=mean_duration_by_day.index, 
            palette='Spectral', 
            legend=False)
plt.title('Average Ride Duration by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Duration (minutes)')
plt.grid(axis='y')

plt.show()

On weekends, people tend to ride for a longer time as compared to weekdays, which could be due to leisure activities or travel. On weekdays, the average ride time is shorter, and Tuesday has the shortest average ride time, which may be due to commuting patterns.

Perform statistical tests to determine whether mean duration differs between passholder_type¶

The most common pass type is the "Monthly Pass". The "Flex Pass" is relatively uncommon.

In [38]:
bicycle['passholder_type'].value_counts()
Out[38]:
passholder_type
Monthly Pass    55962
Walk-up         18061
Annual Pass      6120
One Day Pass     3462
Flex Pass           6
Name: count, dtype: Int64

Show average riding time by passholder_type¶

In [39]:
import matplotlib.pyplot as plt
import seaborn as sns

mean_duration_by_passholder = bicycle.groupby('passholder_type')['duration'].mean()

plt.figure(figsize=(6, 4))

sns.barplot(x=mean_duration_by_passholder.index, 
            y=mean_duration_by_passholder.values, 
            hue=mean_duration_by_passholder.index, 
            palette='viridis', 
            legend=False)

plt.title('Mean Duration by Passholder Type')
plt.xlabel('Passholder Type')
plt.ylabel('Mean Duration (minutes)')

plt.tight_layout()
plt.show()

The users of "One Day Pass" and "Walk-up" have the longest average riding time. On the other hand, the "Annual Pass" and "Flex Pass" holders have the shortest average riding time.

To compare the averages of different passholder types, it is necessary to perform an ANOVA (Analysis of Variance) test due to having more than two sets of data. However, before conducting such a test, it is crucial to ensure that the data meets the basic assumptions of the ANOVA test, which include normal distribution and homogeneity of variation in each set of data.

Normality Test : Shapiro-Wilk¶

In [40]:
from scipy import stats

normality_test_results = {}
for pass_type in bicycle['passholder_type'].unique():
    duration_data = bicycle[bicycle['passholder_type'] == pass_type]['duration']
    normality_test_results[pass_type] = stats.shapiro(duration_data)[1]  # p-value

normality_test_results
/Users/Kate/anaconda3/lib/python3.10/site-packages/scipy/stats/_morestats.py:1882: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")
/Users/Kate/anaconda3/lib/python3.10/site-packages/scipy/stats/_morestats.py:1882: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")
/Users/Kate/anaconda3/lib/python3.10/site-packages/scipy/stats/_morestats.py:1882: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")
Out[40]:
{'Monthly Pass': 0.0,
 'Walk-up': 0.0,
 'Annual Pass': 0.0,
 'Flex Pass': 0.5543804168701172,
 'One Day Pass': 5.135869678977664e-28}

All passholder types, except for "Flex Pass", have p-values that are either close to or equal to 0. This indicates that the ride time data for these pass types does not follow the assumption of a normal distribution. However, it is important to interpret these results with caution because the Shapiro-Wilk test can be too strict when dealing with large sample sizes. As for "Flex Pass", there is only a small amount of data available (only 6 transactions), so the test results may not be as reliable.

Variance homogeneity test : Levene's Test¶

In [41]:
from scipy import stats

levene_test_result = stats.levene(
    *[bicycle[bicycle['passholder_type'] == pass_type]['duration'] for pass_type in bicycle['passholder_type'].unique()]
)[1]

levene_test_result
Out[41]:
0.0

The p-value is close to 0, indicating that the variation in riding time is inconsistent between different passholder_types, violating the assumption of homogeneity of variation.

The ANOVA test assumes normal distribution and homogeneity of variations. However, if the data does not fully meet these assumptions, it is recommended to use the Kruskal-Wallis H test instead. This is a nonparametric method that compares medians from multiple independent samples, and it does not require the data to have normal distribution or the same amount of variation.

Kruskal-Wallis H Test¶

In [42]:
kruskal_test_result = stats.kruskal(
    *[bicycle[bicycle['passholder_type'] == pass_type]['duration'] for pass_type in bicycle['passholder_type'].unique()]
)

kruskal_test_result
Out[42]:
KruskalResult(statistic=7295.39454547583, pvalue=0.0)

The Kruskal-Wallis H test indicates a statistically significant difference in average ride time among different passholder types, with a test statistic of 7295.39 and a p-value close to 0.

The analysis shows that different types of passholders, classified by their passholder_type, tend to have varying average ride times. This implies that some passholder types may use the bikes for longer or shorter durations. For instance, the analysis suggests that "One Day Pass" and "Walk-up" users have longer average ride times, whereas "Annual Pass" and "Flex Pass" users have shorter average ride times.


Task 2: Seed Shape Analysis

Data Exploration¶

Load seeds dataset and analyze the distribution and characteristics of data through describe method¶

In [43]:
import pandas as pd

file_path = 'Datasets/seeds.csv'
seeds = pd.read_csv(file_path)

seeds.head()
Out[43]:
area perimeter compactness length width asymmetry groove_length
0 15.26 14.84 0.871 5.763 3.312 2.221 5.220
1 14.88 14.57 0.881 5.554 3.333 1.018 4.956
2 14.29 14.09 0.905 5.291 3.337 2.699 4.825
3 13.84 13.94 0.895 5.324 3.379 2.259 4.805
4 16.14 14.99 0.903 5.658 3.562 1.355 5.175
In [44]:
seeds.describe()
Out[44]:
area perimeter compactness length width asymmetry groove_length
count 210.000000 210.000000 210.000000 210.000000 210.000000 210.000000 210.000000
mean 14.847524 14.559286 0.871000 5.628533 3.258605 3.700200 5.408071
std 2.909699 1.305959 0.023594 0.443063 0.377714 1.503559 0.491480
min 10.590000 12.410000 0.808000 4.899000 2.630000 0.765000 4.519000
25% 12.270000 13.450000 0.857250 5.262250 2.944000 2.561500 5.045000
50% 14.355000 14.320000 0.873500 5.523500 3.237000 3.599000 5.223000
75% 17.305000 15.715000 0.887750 5.979750 3.561750 4.768750 5.877000
max 21.180000 17.250000 0.918000 6.675000 4.033000 8.456000 6.550000

Missing value check¶

In [45]:
seeds.isnull().sum()
Out[45]:
area             0
perimeter        0
compactness      0
length           0
width            0
asymmetry        0
groove_length    0
dtype: int64

Conduct data exploration¶

In [46]:
sns.set(style="whitegrid")
sns.pairplot(seeds)
plt.show()

There is a clear linear relationship between certain features, such as area and perimeter, length and groove length, while others, such as asymmetry, are more scattered and lack an obvious linear trend.

Show the distribution of each feature¶

In [47]:
plt.figure(figsize=(6, 4))
sns.boxplot(data=seeds)
plt.title("Seeds feature boxplot")
plt.xticks(rotation=45)
plt.show()
  • The position of the median in the box plot indicates different central trends for each feature.
  • Some characteristics, such as asymmetry, have extreme values, which may indicate individual seeds' specificity or data entry anomalies.

Show how different features are related to each other¶

In [48]:
corr_matrix = seeds.corr()

plt.figure(figsize=(6, 4))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Seed feature correlation coefficient heat map")
plt.show()

It has been observed that certain features, such as perimeter and area, length and groove length, etc., have a strong positive correlation with each other. This implies that these features are closely associated with one another. However, some features are less correlated with each other, such as asymmetry, which is not strongly correlated with other features.

Clustering¶

Based on the data exploration and observations above, considering using either K-means or Gaussian Mixture Model (GMM) for cluster analysis. Here are the reasons why these two methods are suitable:

K-means:

  • The scatter plot shows a linear relationship between many features, making K-means a good choice for processing such linearly distributed data.
  • The box plot indicates that most features have a relatively symmetric distribution, which means that K-means can effectively identify clusters formed by these features.
  • K-means is a relatively simple and computationally efficient clustering method, making it a good choice for small datasets with strong correlation between features.

Gaussian Mixture Model (GMM):

  • GMM assumes that the data is a mixture of several Gaussian distributions, which is suitable for datasets with continuous features, especially when the data distribution is complex (for example, some features have extreme values).
  • Unlike K-means, GMM provides a soft assignment, which is the probability that each data point belongs to a different cluster. This is particularly useful when working with overlapping data.
  • GMM is able to adapt to clusters of different shapes and sizes, which is an advantage for datasets with more complex relationships between features.

K-means¶

To determine the optimal K value for K-means, one can use the elbow method. This involves calculating the cost of K-means, which is the sum of the squared distances of each point from its center of mass, over a range of K values. The K value where the cost decreases the fastest is considered the optimal value.

In [49]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, n_init=10)

# data normalization
scaler = StandardScaler()
seeds_scaled = scaler.fit_transform(seeds)
seeds_scaled_df = pd.DataFrame(seeds_scaled, columns=seeds.columns)

wcss = [] 
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42, n_init=10)
    kmeans.fit(seeds_scaled)
    wcss.append(kmeans.inertia_)

# get the elbow
plt.figure(figsize=(6,4))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.title('Elbow Rule')
plt.xlabel('k value')
plt.ylabel('WCSS')
plt.grid(True)
plt.show()

Apply KMeans with K=3¶

In [50]:
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
y_kmeans = kmeans.fit_predict(seeds_scaled)

seeds_clustered = seeds.copy()
seeds_clustered['Cluster'] = y_kmeans

seeds_clustered.head()
Out[50]:
area perimeter compactness length width asymmetry groove_length Cluster
0 15.26 14.84 0.871 5.763 3.312 2.221 5.220 2
1 14.88 14.57 0.881 5.554 3.333 1.018 4.956 2
2 14.29 14.09 0.905 5.291 3.337 2.699 4.825 2
3 13.84 13.94 0.895 5.324 3.379 2.259 4.805 2
4 16.14 14.99 0.903 5.658 3.562 1.355 5.175 2

Plotting the clusters¶

Visualizing based on the two features area and perimeter.

In [51]:
clusters = kmeans.fit_predict(seeds)

plt.figure(figsize=(8, 5))
plt.scatter(seeds['area'], seeds['perimeter'], c=clusters, cmap='viridis', marker='x', alpha=0.7)
plt.title('K-means cluster result (Area vs Perimeter)')
plt.xlabel('Area')
plt.ylabel('Perimeter')
plt.colorbar(label='Cluster')
plt.show()

The dataset is divided into three clusters, each represented by a different color. The data points within each cluster are more closely grouped in terms of area and perimeter, indicating that these two features are significant in clustering. The boundaries between the clusters are distinct, demonstrating that K-means is capable of accurately distinguishing the data. Based on these characteristics, the results of this analysis suggest that K-means clustering is a suitable method for classifying the seed dataset.

Visualize K-means clustering results using principal component analysis (PCA)¶

In [52]:
from sklearn.decomposition import PCA

# reduce the data to two principal components
pca = PCA(n_components=2)
seeds_pca = pca.fit_transform(seeds)

plt.figure(figsize=(8, 5))
plt.scatter(seeds_pca[:, 0], seeds_pca[:, 1], c=clusters, cmap='viridis', marker='o')
plt.title('K-means cluster result (PCA downgrade the dimension)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(label='Cluster')
plt.show()

Each point in this 2D space represents a seed sample, colored by the cluster it belongs to. The clustering structure is well-maintained in the dimensionality reduction process, making it easy to distinguish the three clusters. Principal component analysis effectively captures the primary variability of the data, allowing us to observe the cluster distribution in a 2D space.

Calculate silhouette score¶

The silhouette score ranges from -1 to 1, with higher values ​​indicating better clustering quality.

In [53]:
from sklearn.metrics import silhouette_score

# test k= 2,3,4
k_values = range(2, 5)
silhouette_scores = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(seeds_scaled)
    labels = kmeans.labels_
    silhouette_scores.append(silhouette_score(seeds_scaled, labels))

k_silhouette_compare = list(zip(k_values, silhouette_scores))
k_silhouette_compare
Out[53]:
[(2, 0.46580082738022255), (3, 0.40064436862413266), (4, 0.32634482280657695)]

When k=2, the clustering quality is the best, with a silhouette score of 0.466.

Gaussian Mixture Models¶

Use the initial number = 3 based on the result of the PCA and the elbow method.

In [54]:
from sklearn.mixture import GaussianMixture

# data normalization
scaler = StandardScaler()
data_scaled = scaler.fit_transform(seeds)

# choose 3 as cluster number
n_clusters = 3
gmm = GaussianMixture(n_components=n_clusters, random_state=42)
gmm.fit(data_scaled)

# result prediction
labels = gmm.predict(data_scaled)

plt.scatter(data_scaled[:, 0], data_scaled[:, 1], c=labels, cmap='viridis', marker='x')
plt.xlabel('feature 1')
plt.ylabel('feature 2')
plt.title('GMM Cluster Result')
plt.show()

AIC and BIC scores corresponding to different cluster numbers¶

In [55]:
from sklearn.mixture import GaussianMixture

# data normalization
scaler = StandardScaler()
data_scaled = scaler.fit_transform(seeds)

aic_values = []
bic_values = []
range_of_clusters = range(2, 11)

for n_clusters in range_of_clusters:
    gmm = GaussianMixture(n_components=n_clusters, random_state=0)
    gmm.fit(data_scaled)
    aic_values.append(gmm.aic(data_scaled))
    bic_values.append(gmm.bic(data_scaled))

plt.figure(figsize=(6, 4))
plt.plot(range_of_clusters, aic_values, label='AIC', marker='o')
plt.plot(range_of_clusters, bic_values, label='BIC', marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Scores')
plt.title('AIC and BIC Scores for Different Number of Clusters')
plt.legend()
plt.grid(True)
plt.show()

This chart suggests that 2 clusters may be the best choice for this set of data, since at this number both AIC and BIC reach relatively low values.

Compute Silhouette Scores¶

In [56]:
from sklearn.mixture import GaussianMixture

range_of_clusters = range(2, 5)
silhouette_scores = []

for n_clusters in range_of_clusters:
    gmm = GaussianMixture(n_components=n_clusters, random_state=0)
    labels = gmm.fit_predict(seeds)
    score = silhouette_score(seeds, labels)
    silhouette_scores.append(score)

# evaluate
eval_clusters = range_of_clusters[silhouette_scores.index(max(silhouette_scores))]
eval_clusters, max(silhouette_scores)
Out[56]:
(2, 0.48772862024342833)

The silhouette coefficient is approximately 0.488, indicating moderate clustering quality with some separation between clusters, but potential for improvement.

Visualize the correlation of 2 clusters¶

In [57]:
optimal_clusters = 2
gmm = GaussianMixture(n_components=optimal_clusters, random_state=0)
labels = gmm.fit_predict(seeds)

data_with_clusters = seeds.copy()
data_with_clusters['Cluster'] = labels

plt.figure(figsize=(10, 6))
sns.pairplot(data_with_clusters, hue='Cluster', palette='viridis')
plt.suptitle("GMM Cluster Analysis of Seeds Data", y=1.02)
plt.show()
<Figure size 1000x600 with 0 Axes>

The results of GMM clustering are represented by different colors for different clusters. If the separation between clusters is clearly visible, it usually indicates good clustering. However, if the boundaries between clusters are blurred, it may indicate that the clustering effect is average.

GMM is preferred for the following resons:

  • GMM may perform better than K-Means if clusters are not circular or spherical, or if they are of different sizes and densities. The grid search suggests that SVM with a linear kernel is a good choice for this dataset due to its high accuracy and interpretability. Monitor the model's performance on new data to prevent overfitting and make timely adjustments.

  • When data set clusters overlap, GMM's soft clustering captures overlaps better than K-Means' hard clustering, which can lead to unclear classification.

  • Although the use of GMM may be computationally more expensive, it does not affect training due to the small size of the data set.

Compare the source dataset to the provided seeds.csv¶

In [58]:
# load data to see the file format
txt_file_path = 'Datasets/seeds_dataset.txt'

with open(txt_file_path, 'r') as file:
    seeds_txt_content = file.read()

seeds_txt_content[:500]
Out[58]:
'15.26\t14.84\t0.871\t5.763\t3.312\t2.221\t5.22\t1\n14.88\t14.57\t0.8811\t5.554\t3.333\t1.018\t4.956\t1\n14.29\t14.09\t0.905\t5.291\t3.337\t2.699\t4.825\t1\n13.84\t13.94\t0.8955\t5.324\t3.379\t2.259\t4.805\t1\n16.14\t14.99\t0.9034\t5.658\t3.562\t1.355\t5.175\t1\n14.38\t14.21\t0.8951\t5.386\t3.312\t2.462\t4.956\t1\n14.69\t14.49\t0.8799\t5.563\t3.259\t3.586\t5.219\t1\n14.11\t14.1\t0.8911\t5.42\t3.302\t2.7\t\t5\t\t1\n16.63\t15.46\t0.8747\t6.053\t3.465\t2.04\t5.877\t1\n16.44\t15.25\t0.888\t5.884\t3.505\t1.969\t5.533\t1\n15.26\t14.85\t0.8696\t5.714\t3.242\t4.543\t5.314\t1\n14.03\t14.16\t0.87'

The original dataset includes a label field that is not present in the provided dataset. This label field probably indicates the actual type or category of the seed, which is helpful for supervised learning tasks.

In [59]:
# add column names and save new file
file_path = 'Datasets/seeds_dataset.txt'
column_names = ["area", "perimeter", "compactness", "length", "width", "asymmetry", "groove_length", "label"]
seeds_source = pd.read_csv(file_path, delim_whitespace=True, header=None, names=column_names)

output_path = 'Datasets/seeds_source.csv'
seeds_source.to_csv(output_path, index=False) 
In [60]:
# read the source data
file_path = 'Datasets/seeds_source.csv'
source_seeds = pd.read_csv(file_path)

source_seeds.head()
Out[60]:
area perimeter compactness length width asymmetry groove_length label
0 15.26 14.84 0.8710 5.763 3.312 2.221 5.220 1
1 14.88 14.57 0.8811 5.554 3.333 1.018 4.956 1
2 14.29 14.09 0.9050 5.291 3.337 2.699 4.825 1
3 13.84 13.94 0.8955 5.324 3.379 2.259 4.805 1
4 16.14 14.99 0.9034 5.658 3.562 1.355 5.175 1

The additional label field in the original data set provides a possible true cluster structure and can be used to evaluate the accuracy of the clustering results after the analysis is complete.

Re-evaluate¶

Access to the actual labels allows for assessment of the quality of the clustering by comparing the results with the real labels. This can be achieved through various methods such as confusion matrix, purity, Rand index.

In [61]:
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics.cluster import rand_score

# Separate the features and the label
X = source_seeds.drop('label', axis=1)
y_true = source_seeds['label']
n_clustrs = len(y_true.unique()) # 3

# standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=n_clustrs, random_state=0, n_init=10)
y_pred_scaled = kmeans.fit_predict(X_scaled)

# confusion matrix
conf_matrix_scaled = confusion_matrix(y_true, y_pred_scaled)

# heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix_scaled, annot=True, fmt="d", cmap="Blues", 
            xticklabels=True, yticklabels=True)
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()

Although most data points are correctly classified, some are still misclassified in the confusion matrix.

Rand index¶

In [62]:
rand_index_scaled = rand_score(y_true, y_pred_scaled)
rand_index_scaled
Out[62]:
0.8997038049669629

The Rand index is 0.8997, indicating good agreement between clustering results and actual labels.

Calculate purity¶

In [63]:
def calculate_purity(conf_matrix):
    cluster_max = np.sum(np.amax(conf_matrix, axis=0))
    total_points = np.sum(conf_matrix)
    purity = cluster_max / total_points
    return purity

purity = calculate_purity(conf_matrix_scaled)
purity
Out[63]:
0.919047619047619

Between the range 0 and 1, a purity value of 0.9190 indicates that the clustering results match the true labels fairly well. Ignoring the distribution of labels, purity may be high if one cluster contains all other labels.

Classification¶

Support Vector Machines (SVM) are effective for handling nonlinear and high-dimensional data by utilizing various kernel functions to handle linearly inseparable data.

In [64]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report

y = source_seeds['label']

# split data for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# set SVM with default params
svm_model = SVC(kernel='rbf', C=1.0, gamma='auto') 
#svm_model = SVC(kernel='linear', C=100, gamma='scale') 

# train
svm_model.fit(X_train_scaled, y_train)

# prediction
y_pred = svm_model.predict(X_test_scaled)

# result
report = classification_report(y_test, y_pred)
print(report)
              precision    recall  f1-score   support

           1       0.86      0.90      0.88        20
           2       1.00      0.95      0.98        21
           3       0.91      0.91      0.91        22

    accuracy                           0.92        63
   macro avg       0.92      0.92      0.92        63
weighted avg       0.92      0.92      0.92        63

The result above is a performance evaluation report of SVM classifier.

  • The precision rate and the recall rate overall are quite high, which means that the model can accurately identify 90% of the samples that belong to this category.

  • The F1 score for the first class is 0.88, indicating a relatively balanced precision and recall for this class.

  • The macro average is 0.92, indicating good overall model performance..

  • The weighted average is 0.92, which indicates that the model performed well across all classes, despite differences in sample size. This is a good sign of the model's overall performance.

The SVM classifier used in this dataset has shown excellent performance, achieving scores of 0.9 or higher for all metrics. This indicates that the model is highly accurate in classifying different categories of seeds, with a low error rate. Nevertheless, it is important to note that such high performance may lead to overfitting issues.

Use cross-validation to evaluate the performance of SVM models¶

In [65]:
from sklearn.model_selection import cross_val_score
accuracy_scores = cross_val_score(svm_model, X_scaled, y, cv=5, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f)" % (accuracy_scores.mean(), accuracy_scores.std() * 2))
Accuracy: 0.91 (+/- 0.16)

The results of the cross-validation indicate that the SVM model performs well on this dataset, with an average accuracy of 0.91 and a standard deviation of 0.16, indicating that it has good classification capabilities. However, there is some variability in accuracy across different training and testing splits, as indicated by a standard deviation of 0.08. The range obtained by adding and subtracting the standard deviation can be considered a confidence interval for the model's performance.

The relatively wide confidence interval suggests that different data splits may bring certain changes to the model's performance. This could be due to some degree of heterogeneity within the dataset, or it could be that the dataset itself is small, resulting in large differences on different cross-validation splits.

Overall, the model demonstrates strong classification ability, but the cross-validation variability suggests that we may need to consider using more data to improve its generalization ability.

Perform grid search and cross-validation to find the best combination of SVM parameters¶

In [66]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# set the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

# set the param grid
param_grid = {
    'svm__C': [0.1, 1, 10, 100],
    'svm__gamma': ['scale', 'auto', 0.01, 0.1, 1],
    'svm__kernel': ['rbf', 'poly', 'linear'] 
}

# init grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X, y)

# result
best_parameters = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Best Parameters: {best_parameters}")
print(f"Best Score: {best_score:.4f}")
Best Parameters: {'svm__C': 100, 'svm__gamma': 'scale', 'svm__kernel': 'linear'}
Best Score: 0.9476

Pros:

  • Using 'scale' for gamma adjusts its value based on feature variance, resulting in better generalization than a fixed value.

  • SVM models that use linear kernels are relatively easy to interpret. The model's decision-making process is easier to understand, and feature weights can be directly viewed to assess feature importance.

Cons:

  • Although increasing the C value can improve the model's fit to the training data, it can also lead to overfitting, especially when the dataset is small or the number of features is large.

  • Training SVMs, especially with nonlinear kernels, can be time-consuming for large datasets.

To summarize, the grid search results indicate that SVM with a linear kernel is a suitable choice for this dataset, as it exhibits high accuracy and good interpretability. It is important to monitor the model's performance on new data to prevent overfitting and make necessary adjustments in a timely manner.

SVM is preferred for the following resons:

  • SVM can handle nonlinear relationships by using kernel techniques to map data into a higher-dimensional space and find linearly separable boundaries.

  • For data sets where the sample size is not very large, SVM can usually achieve good performance.

  • The SVM regularization feature enables users to control overfitting by adjusting the regularization parameter C to maximize the decision boundary margin while minimizing error.

  • Compared to decision trees or random forests, SVM naturally prevents overfitting, especially with more features than samples.


References¶

  • VanderPlas, Jake (2022). Python Data Science Handbook. O'Reilly Media, Inc.
  • Pang-Ning Tan; Michael Steinbach; Anuj Karpatne; Vipin Kumar (2019). Introduction to Data Mining. Pearson.